creating appropriate corpus for information retrieval and natural language processing in persian language
نویسندگان
چکیده
persian natural language processing (nlp) researchers have many limitations to access linguistic tools which are suitable for text processing. therefore, researchin persian text processing is very limited. since dataset is an important requirement for experiments and their evaluation, we aimed to create appropriate corpora for information retrieval and natural language processing in persian. the provided corpora in this article are based on hamshahri dataset which is appropriate for simple information retrieval and simple natural language processing because it has not been tagged. we converted this dataset to tagged collection and increased its text quality. the new corpora minimize the text preprocessing requirement. here we have used step-1 tools for text processing and have proposed some ideas to remove the bugs of these tools in order to increase their quality. at the end we used the new corpora for text retrieval and results showed performance improvement.
منابع مشابه
applying natural language processing techniques for effective persian- english cross-language information retrieval
much attention has recently been paid to natural language processing in information storage and retrieval. this paper describes how the application of natural language processing ( nlp ) techniques can enhance cross-language information retrieval ( clir ). using a semi-experimental technique, we took farsi queries to retrieve relevant documents in english. for translating persian queries, we us...
متن کاملApplying Natural Language Processing Techniques for Effective Persian- English Cross-Language Information Retrieval
Much attention has recently been paid to natural language processing in information storage and retrieval. This paper describes how the application of natural language processing (NLP) techniques can enhance cross-language information retrieval (CLIR). Using a semi-experimental technique, we took Farsi queries to retrieve relevant documents in English. For translating Persian queries, we used a...
متن کاملArabic Natural Language Processing for Information Retrieval
Human Language Technology has played a big role in implementing Latin based information retrieval systems. Two of the most sited techniques are stemming and truncation. Numerous studies have showed that the inflectional structure of words has a big impact on the retrieval accuracy of Latin-based languages information retrieval systems (IRS). Stemming or truncation is done for two principal reas...
متن کاملNatural Language Processing in Information Retrieval
Many Natural Language Processing (NLP) techniques have been used in Information Retrieval. The results are not encouraging. Simple methods (stopwording, porter-style stemming, etc.) usually yield significant improvements, while higher-level processing (chunking, parsing, word sense disambiguation, etc.) only yield very small improvements or even a decrease in accuracy. At the same time, higher-...
متن کاملInformation Retrieval and Trainable Natural Language Processing
Existing work on indexing and retrieving documents from large on-line collections has had great success at treating both documents and queries as simple, unstructured collections of individual words (terms) with dependencies among these terms largely ignored. However, natural language text has a great deal of structure. In particular, at a scale close to that of the individual word, there are i...
متن کاملInformation Retrieval Using Robust Natural Language Processing
We developed a fully automated Information Retrieval System which uses advanced natural language processing techniques to enhance the effectiveness of traditional key-word based document retrieval. In early experiments with the standard CACM-3204 collection of abstracts, the augmented system has displayed capabilities that made it clearly superior to the purely statistical base system. 1. O V E...
متن کاملمنابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
international journal of information science and managementجلد ۱۳، شماره ۲، صفحات ۰-۰
میزبانی شده توسط پلتفرم ابری doprax.com
copyright © 2015-2023